perm filename PROPON[7,ALS] blob
sn#032365 filedate 1973-04-02 generic text, type T, neo UTF8
April 1 1973
A Proposal for Speech Understanding Research
It is proposed that the work on speech recognition that is
now under way in the A.I. project at Stanford University be continued
and extended as a separate project with broadened aims in the field
of speech understanding.
It is further proposed that this work be more closely tied to
the ARPA Speech Understanding Research groups than it has been in the
past and that it have as its express aim the study and application to
speech recognition of a machine learning process, that has proved
highly successful in another application and that has already been
tested out to a limited extent in speech recognition. The machine
learning process offers both an automatic training scheme and the
inherent ability of the system to adapt to various speakers and
dialects. Speech recognition via machine learning represents a global
approach to the speech recognition problem and can be incorporated
into a wide class of limited vocabulary systems. Ultimately we would
like to have a system capable of understanding speech from an
unlimited domain of discourse and with unknown speakers. It seems not
unreasonable to expect the system to deal with this situation very
much as people do when they adapt their understanding processes to
the speakers idiosyncrasies during the conversation.
With so much of the current work on speech understanding
being devoted to the development of systems designed to work in a
limited field of discourse and with a limited number of speakers, it
seems desirable for a minimal program to be continued that is not so
restricted. It is felt that we should not lose sight of those aspects
of the problem that are for the moment peripherial to the immediate
aims of developing the best complete system that can currently be
built. Stanford University is well suited as the site for such work,
having both the facilities for this work and a staff of people with
experience and interest in machine learning, phonetic analysis, and
digital signal processing.
The initial thrust of the proposed work would be toward the
development of adaptive learning techniques, using the signature
table method and some more recent varients and extentions of this
basic procedure. We have already demonstrated the usefulness of this
method for the initial assignment of significant features to the
acoustic signals. One of the next steps will be to extend the method
to include acoustic-phonetic probabilities in the decision process.
Finally we would hope to take account of syntactical and semantic
constraints in a somewhat analogous fashion.
Still another aspect to be studied would be the amount of
preprocessing that should be done and the desired balance between
bottom-up and top-down approaches. It is fairly obvious that
decisions of this sort should ideally be made adaptively depending
upon the familiarity of the system with the current domain of
discourse and with the characteristics of the current speaker.
Compromises will undoubtedly have to be made in any immediately
realizable system but we should understand better than we now do the
limitations on the system that such compromises impose.
Finally we would propose accepting responsibility for keeping
other related projects supplied with operating versions of the best
current programs that we have developed to interface the output from
the digitized speech or from a frequency domain expression of this
input to the rest of the overall system.
It may be well at this point to discribe the general
philosophy that has been followed in the work that is currently under
way and the results that have been achieved to date. We have been
studying elements of a speech recognition system that is not
dependent upon the use of a limited vocabulary and that can recognize
continuous speech by a number of different speakers.
Such a system should be able to function successfully either
without any previous training for the specific speaker in question or
after a short training session in which the speaker would be asked to
repeat certain phrases designed to train the system on those phonetic
utterances that seemed to depart from the previously learned norm. In
either case it is believed that some automatic or semi-automatic
training system should be employed to acquire the data that is used
for the identification of the phonetic information in the speech. We
believe that this can best be done by employing a modification of the
signature table scheme previously discribed. A brief review of this
earlier form of signature table is given in Appendix 1.
The over-all system is envisioned as one in which the more or
less conventional method is used of separating the input speech into
short time slices for which some sort of frequency analysis,
homomorphic, LPC, or the like, is done. We then interpret this
information in terms of significant features by means of a set of
signature tables. At this point we define longer sections of the
speech called EVENTS which are obtained by grouping togather varying
numbers of the original slices on the basis of their similarity.This
then takes the place of other forms of initial segmentation. Having
identified a series of EVENTS in this way we next use another set of
signature tables to extract information from the sequence of events
and combine it with a limited amount of syntactic and semantic
information to define a sequence of phonemes.
Signature tables can be used to perform four essential
functions that are required in the automatic recognition of speech.
These functions are: (1) the elimination of superfluous and
redundant information from the acoustic input stream, (2) the
transformation of the remaining information from one coordinate
system to a more phonetically meaningful coordinate system, (3) the
mixing of acoustically derived data with syntactic, semantic and
linguistic information to obtain the desired recognition, and (4) the
introduction of a learning mechanism.
The following three advantages emerge from this method of
training and evaluation.
1) Essentially arbitrary inter-relationships between the
input terms are taken in account by any one table. The only loss of
accuracy is in the quantization.
2) The training is a very simple process of accumulating
counts. The training samples are introduced sequentially, and hence
simultaneous storage of all the samples is not required.
3) The process linearizes the storage requirements in the
parameter space.
The signature tables, as used in speech recognition,must be
particularized to allow for the multi-catagory nature of the output.
Several forms of tables have been investigated. Details of the current
system are given in Appendix 2. Some results are summarized in an
attached report.
Work is currently under way on a major refinement of the
signature table approach which adopts a somewhat more rigorous
procedure. Preliminary results with this scheme indicate that a
substantial improvement has been achieved.
Appendix 1
The early form of a signature table
For those not familiar with the use of signature tables as
used by Samuel in programs which played the game of checkers, the
concept is best illustrated (Fig.1) by an arrangement of tables used
in the program. There are 27 input terms. Each term evaluates a
specific aspect of a board situation and it is quantized into a
limited but adequate range of values, 7,5,and 3, in this case. The
terms are divided into 9 sets with 3 terms each, forming the 9 first
level tables. Outputs from the first level tables are quantized to 5
levels and combined into 3 second level tables and, finally, into one
third-level table whose output represents the figure of merit of the
board in question.
A signature table has an entry for every possible combination
of the input vector. Thus there are 7*5*3 or 105 entries in each of
the first level tables. Training consists of accumulating two counts
for each entry during a training sequence. Count A is incremented
when the current input vector represents a prefered move and count D
is incremented when it is not the prefered move. The output from the
table is computed as a correlation coeficient
C=(A-D)/(A+D). The figure of merit for a
board is simply the coefficient obtained as the output from the final
table.
Appendix 2
Initial Form of Signature Table for Speech Recognition
The signature tables, as used in speech recognition,must be
particularized to allow for the multi-catagory nature of the output.
Several forms of tables have been investigated. The initial form
tested and used for the data presented in the attached paper uses
tables consisting of two parts, a preamble and the table proper. The
preamble contains: (1) space for saving a record of the current and
recent output reports from the table, (2) identifying information as
to the specific type of table, (3) a parameter that identifies the
desired output from the table and that is used in the learning
process, (4) a gating parameter specifying the input, that is to be
used to gate the table, (6) the gating level to be used and (7)
parameters that identify the sources of the normal inputs to the
table.
All inputs are limited in range and specify either the
absolute level of some basic property or more usually the probability
of some property being present. These inputs may be from the original
acoustic input or they may be the outputs of other tables. If from
other tables they may be for the current time step or for earlier
time steps, (subject to practical limits as to the number of time
steps that are saved).
The output, or outputs, from each table are similarly limited
in range and specify, in all cases, a probability that some
particular significant feature, phonette, phoneme, word segment, word
or phrase is present.
We are limiting the range of inputs and outputs to values
specified by 3 bits and the number of entries per table to 64
although this choice of values is a matter to be determined by
experiment. We are also providing for any of the following input
combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,
(3) three inputs of 2 bits each, and (4) six inputs of 1 bit each.
The uses to which these differint forms are put will be described
later.
The body of each table contains entries corresponding to
every possible combination of the allowed input parameters. Each
entry in the table actually consists of several parts. There are
fields assigned to accumulate counts of the occurrances of incidents
in which the specifying input values coincided with the different
desired outputs from the table as found during previous learning
sessions and there are fields containing the summarized results of
these learning sessions, which are used as outputs from the table.
The outputs from the tables can then express to the allowed accuracy
all possible functions of the input parameters.
Operation in the Training Mode
When operating in the training mode the program is supplied
with a sequence of stored utterances with accompanying phonetic
transcriptions. Each segment of the incoming speech signal is
analysed (Fourier transforms or inverse filter equivalent) to obtain
the necessary input parmeters for the lowest level tables in the
signature table hierarchy. At the same time reference is made to a
table of phonetic "hints" which prescribe the desired outputs from
each table which correspond to all possible phonemic inputs. The
signature tables are then processed.
The processing of each table is done in two steps, one
process at each entry to the table and the second only periodically.
The first process consists of locating a single entry line within the
table as specified by the inputs to the table and adding a 1 to the
appropriate field to indicate the presence of the property specified
by hint table as corresponding to the phoneme specified in the
phonemic transcription. At this time a report is also made as to the
table's output as determined from the averaged results of previous
learning so that a running record may be kept of the performance of
the system. At periodic intervals all tables are updated to
incorporate recent learning results. To make this process easily
understandable, let us restrict our attention to a table used to
identify a single significant feature say Voicing. The hint table
will identify whether or not the phoneme currently being processed is
to be considered voiced. If it is voiced, a 1 is added to the "yes"
field of the entry line located by the normal inputs to the table. If
it is not voiced, a 1 is added to the "no" field. At updating time
the output that this entry will subsequently report is determined by
dividing the accumulated sum in the "yes" field by the sum of the
numbers in the "yes" and the "no" fields, and reporting this quantity
as a number in the range from 0 to 7. Actually the process is a bit
more complicated than this and it varies with the exact type of table
under consideration, as reported in detail in appendix B. Outputs
from the signature tables are not probabilities, in the strict sense,
but are the statistically-arrived-at odds based on the actual
learning sequence.
The preamble of the table has space for storing twelve past
outputs. An input to a table can be delayed to that extent.This table
relates outcomes of previous events with the present hint-the
learning input.A certain amount of context dependent learning is thus
possible with the limitation that the specified delays are constant.
The interconnected hierarchy of tables form a network which
runs increamentally, in steps synchronous with time window over which
the input signal is analised.The present window width is set at 12.8
ms.(256 points at 20 K samples/sec.) with overlap of 6.4 ms. Inputs
to this network are the parameters abstracted from the frequency
analyses of the signal, and the specified hint.The outputs of the
network could be either the probability attached to every phonetic
symbol or the output of a table associated with a feature such as
voiced,vowel ect.The point to be made is that the output generated
for a segment is essentially independent of its contiguous
segments.The dependency achieved by using delayes in the inputs is
invisible to the outputs.The outputs thus report the best estimate on
what the current acoustic input is with no relation to the past
outputs.Relating the successive outputs along the time dimension is
realised by counters.
The Use of COUNTERS
The transition from initial segment space to event space is
made posible by means of COUNTERS which are summed and reiniated
whenever their inputs cross specified threshold values, being
triggered on when the input exceeds the threshold and off when it
falls below. Momentary spikes are eliminated by specifying time
hysteresis, the number of consecutive segments for which the input
must be above the threshold.The output of a counter provides
information about starting time,duration and average input for the
period it was active.
Since a counter can reference a table at any level in the
hierarchy of tables, it can reflect any desired degree of information
reduction. For example, a counter may be set up to show a section of
speech to be a vowel,a front vowel or the vowel /I/.The counters can
be looked upon to represent a mapping of parameter-time space into a
feature-time space, or at a higher level symbol-time space.It may be
useful to carry along the feature information as a back up in those
situations where the symbolic information is not acceptable to
syntactic or semantic interpretation.
In the same manner as the tables, the counters run completely
independent of each other.In a recognition run the counters may
overlap in arbitrary fashion, may leave out gaps where no counter has
been triggered or may not line up nicely.A properly segmented output,
where the consecutive sections are in time sequence and are neatly
labled, is essential for processing it further.This is achieved by
registering the instants when the counters are triggered or
terminated to form time segments called events.
An event is the period between successive activation or
termination of any counter.An event shorter than a specified time is
merely ignored. A record of event durations and upto three active
counters, ordered according to their probability, is maintained.
An event resulting from the processing described so far,
represents a phonette - one of the basic speech categories defined as
hints in the learning process. It is only an estimate of closeness to
a speech category , based on past learning.Also each category has a
more-or-less stationary spectral characterisation.Thus a category may
have a phonemic equivalent as in the case of vowels , it may be
common to phoneme class as for the voiced or unvoiced stop gaps or it
may be subphonemic as a T-burst or a K-burst.The choices are based on
acoustic expediency, i.e. optimisation of the learning rather than
any linguistic considerations.However a higher level interpretive
programs may best operate on inputs resembling phonemic
trancription.The contiguous events may be coalesced into phoneme like
units using diadic or triadic probabilities and acoustic-phonetic
rules particular to the system.For example, a period of silence
followed by a type of burst or a short friction may be combined to
form the corresponding stop.A short friction or a burst following a
nasal or a lateral may be called a stop even if the silence period is
short or absent.Clearly these rules must be specific to the system,
based on the confidence with which durations and phonette categories
are recognised.
While it would be possible to extend this bottom up approach
still further, it seems reasonable to break off at this point and
revert to a top down approach from here on. The real difference in
the overall system would then be that the top down analysis would
deal with the outputs from the signature table section as its
primatives rather than with the outputs from the initial measurements
either in the time domain or in the frequency domain. In the case of
inconsistancies the system could either refer to the second choices
retained within the signature tables or if need be could always go
clear back to the input parameters. The decision as to how far to
carry the initial bottom up analysis must depend upon the relative
cost of this analysis both in complexity and processing time and the
certainty with which it can be performed as compaired with the costs
associated with the rest of the analysis and the certainty with which
it can be performad, taking due notice of the costs in time of
recovering from false starts.